35 research outputs found
Making Asynchronous Stochastic Gradient Descent Work for Transformers
Asynchronous stochastic gradient descent (SGD) is attractive from a speed
perspective because workers do not wait for synchronization. However, the
Transformer model converges poorly with asynchronous SGD, resulting in
substantially lower quality compared to synchronous SGD. To investigate why
this is the case, we isolate differences between asynchronous and synchronous
methods to investigate batch size and staleness effects. We find that summing
several asynchronous updates, rather than applying them immediately, restores
convergence behavior. With this hybrid method, Transformer training for neural
machine translation task reaches a near-convergence level 1.36x faster in
single-node multi-GPU training with no impact on model quality
Approximating neural machine translation for efficiency
Neural machine translation (NMT) has been shown to outperform statistical machine
translation. However, NMT models typically require a large number of parameters
and are expensive to train and deploy. Moreover, its large model size makes parallel
training inefficient due to costly network communication. Likewise, distributing and
locally running the model for a client-based NMT model such as a web browser or
mobile device remains challenging. This thesis investigates ways to approximately
train an NMT system by compressing either the gradients or the parameters for faster
communication or reduced memory consumption. We propose a gradient compression
technique that exchanges only the top 1% of the most significant gradient values while
delaying the rest to be considered for the next iteration. This method reduces the
network communication cost by 50-fold but causes noisy gradient updates. We also
find that Transformer–the current state-of-the-art NMT architecture–is highly sensitive
to noisy gradients. Therefore, we extend the compression technique by restoring the
compressed gradient with locally-computed gradients. We obtained a linear scale-up
in parallel training without sacrificing model performance. We also explore transfer
learning as a better method of initialising the training. With transfer learning, the model
converges faster and can be trained with more aggressive hyperparameters. Lastly, we
propose a log-based quantisation method to compress the model size. Models are
quantised to 4-bit precision with no noticeable quality degradation after re-training
combined with reserving the quantisation errors as feedback
Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering
We introduce Mintaka, a complex, natural, and multilingual dataset designed
for experimenting with end-to-end question-answering models. Mintaka is
composed of 20,000 question-answer pairs collected in English, annotated with
Wikidata entities, and translated into Arabic, French, German, Hindi, Italian,
Japanese, Portuguese, and Spanish for a total of 180,000 samples. Mintaka
includes 8 types of complex questions, including superlative, intersection, and
multi-hop questions, which were naturally elicited from crowd workers. We run
baselines over Mintaka, the best of which achieves 38% hits@1 in English and
31% hits@1 multilingually, showing that existing models have room for
improvement. We release Mintaka at https://github.com/amazon-research/mintaka.Comment: Accepted at COLING 202
LLM-powered Data Augmentation for Enhanced Cross-lingual Performance
This paper explores the potential of leveraging Large Language Models (LLMs)
for data augmentation in multilingual commonsense reasoning datasets where the
available training data is extremely limited. To achieve this, we utilise
several LLMs, namely Dolly-v2, StableVicuna, ChatGPT, and GPT-4, to augment
three datasets: XCOPA, XWinograd, and XStoryCloze. Subsequently, we evaluate
the effectiveness of fine-tuning smaller multilingual models, mBERT and XLMR,
using the synthesised data. We compare the performance of training with data
generated in English and target languages, as well as translated
English-generated data, revealing the overall advantages of incorporating data
generated by LLMs, e.g. a notable 13.4 accuracy score improvement for the best
case. Furthermore, we conduct a human evaluation by asking native speakers to
assess the naturalness and logical coherence of the generated examples across
different languages. The results of the evaluation indicate that LLMs such as
ChatGPT and GPT-4 excel at producing natural and coherent text in most
languages, however, they struggle to generate meaningful text in certain
languages like Tamil. We also observe that ChatGPT falls short in generating
plausible alternatives compared to the original dataset, whereas examples from
GPT-4 exhibit competitive logical consistency.Comment: EMNLP 2023 Main Conferenc
In Neural Machine Translation, What Does Transfer Learning Transfer?
Transfer learning improves quality for low-resource machine translation, but it is unclear what exactly it transfers. We perform several ablation studies that limit information transfer, then measure the quality impact across three language pairs to gain a black-box understanding of transfer learning. Word embeddings play an important role in transfer learning, particularly if they are properly aligned. Although transfer learning can be performed without embeddings, results are sub-optimal. In contrast, transferring only the embeddings but nothing else yields catastrophic results. We then investigate diagonal alignments with auto-encoders over real languages and randomly generated sequences, finding even randomly generated sequences as parents yield noticeable but smaller gains. Finally, transfer learning can eliminate the need for a warm-up phase when training transformer models in high resource language pairs
Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation
Instruction tuning has shown great promise in the field of natural language
processing. However, the research on multilingual instruction tuning has been
limited due to the scarcity of high-quality instruction-response datasets. To
address this gap, we present Bactrian-X, a comprehensive multilingual parallel
dataset of 3.4 million instruction-response pairs across 52 languages.
Leveraging this dataset, we train a set of adapters using low-rank adaptation
(LoRA), which are lightweight components seamlessly integrated with
foundational models. These adapters have a significantly smaller parameter
count than the base model, making them easily replaceable and usable as
plug-ins for different languages or language groups. Through extensive
experiments on 52 languages, we demonstrate the superior performance of our
models in various multilingual evaluation settings. Our proposed models
outperform both the vanilla models and the existing instruction-tuned models.
The code and models are publicly available at
https://github.com/mbzuai-nlp/bactrian-x